Constructions represented in parallel with structural and lexical items for identifying presence of attitude robustly across domains in text

نویسندگان

  • Jussi Karlgren
  • Gunnar Eriksson
  • Magnus Sahlgren
  • Oscar Täckström
چکیده

This paper describes experiments to use non-terminological information to find attitudinal expressions in written English text. The experiments are based on an analysis of text with respect to not only the vocabulary of content terms present in it (which most other approaches use as a basis for analysis) but also on structural features of the text as represented by presence of form words (in other approaches often removed by stop lists) and by presence of constructional features (typically disregarded by most other analyses). In our analysis, following a construction grammar framework, structural features are treated as occurrences, similarly to the treatment of vocabulary features. The constructional features in play are chosen to potentially signify opinion but are not specific to negative or positive expressions. The framework is used to classify clauses, headlines, and sentences from three different shared collections of attitudinal data. We find that constructional features transfer well and show potential for generalisation across different text collections. 1. ATTITUDE ANALYSIS IS MOSTLY BASED ON LEXICAL STATISTICS Attitude analysis, a subtask of information refinement from texts, has gained interest in recent years, both for its application potential and for the promise of shedding new light on hitherto unformalised aspects of human language usage: the expression of attitude, opinion, or sentiment is a quintessentially human activity. it is not explicitly conventionalised to the degree that many other aspects of language usage are. Most attempts to identify attitudinal expression in text has been based on lexical factors. Resources such as SentimentWordNet or the General Inquirer lexicon are utilised, or similar resources developed, by most research groups engaged in attitude analysis tasks.[3, 12] But attitude is not a solely lexical matter. Expressions with identical or near-identical terms can be more or less attitudinal by virtue of their form; combinations of fairly attitudinally loaded terms may lack Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. SIGIR 2009 2009 Boston USA Copyright 2009 ACM X-XXXXX-XX-X/XX/XX ...$5.00. attitudinal power; certain terms considered neutral in typical language use can have strong attitudinal loading in certain discourses or certain times. Our approach takes as its starting point the observation that lexical resources always are noisy, out of date, and most often suffer simultaneously from being both too specific and too general. Not only are lexical resources inherently somewhat unreliable or costly to maintain, but they do not cover all the possibilites of expression afforded by human linguistic behaviour: we believe that attitudinal expression in text is not solely a lexical issue. We have previously tested resource-thrifty approaches for annotation of textual materials, arguing that general purpose linguistic analysis together with appropriate background materials for training a general language model provide a more general, more portable, and more robust methodology for extracting information from text.[?] 1 2. CONSTRUCTIONS AS CHARACTERISTIC FEATURES OF UTTERANCES Most categorisation features used for any type of text or text snippet categorisation are term occurrence based. Utterances are seen as sequences or bags of words: “w1w2...wi...wj ...wn” and the observations of ws are subjected to frequency or occurrence analyses to yield features such as frequency features: Is some term wi unusually frequent or infrequent?, cooccurrence features: Is some combination of terms wi, wj unusually frequent or infrequent? or equivalence classes: Can some term wi be substituted or generalised to a class or concept marker? Our hypotheses are that investigating utterances for presence of content-bearing words may be useful for identifying attitudinal expressions, but that finding structural features carries over easier from one topical area to another, from one discourse to another. We view utterances as more than the words that appear in them. The pattern of an utterance is an observable item in itself. It has previously been suggested that attitude in text is carried by dependencies among words, rather than by keywords, cue phrases, or high-frequency words.[1] We agree, but in contrast with previous work, we explicitly incorporate constructions in our knowledge representation, not as relations between terms but as features in their own right. [8] This paper describes an experiment to investigate the attitudinal power of linguistic constructions in utterances. It Reference and details of experiment omitted for review. compares the effect of constructional features by using a test set together with a reasonably chosen background text collection, and then using the same method on a test set with different topical content. For these present experiments reported in this paper no attitudinal lexical resources were used — only general purpose linguistic analysis was employed to establish the constructions used in the further processes. 3. FEATURE SETS: TERMS AND CONSTRUCTIONS The texts used in this experiment are viewed as sequences of sentences: the sentence is taken as the basis of analysis, as a proxy for the utterance we view as the basis for attitudinal expression. All texts in this experiment are preprocessed by a linguistic analysis toolkit, resulting in a lexical categorisation of each word and a full dependency parse for each sentence. From that analysis, three types of features are extracted to represent sentences: content words (I), function words (F ) and construction markers (K). 3.1 Content and function words All words that are assigned a content part-of-speech category by the lexical analysis are considered members of the content word (I) class and the base form of such words are used as I features when occurring in a sentence. All further words in a sentence, belonging to remaining classes of part-of-speech 4 are judged function words and their base forms are used as F features in the sentence representation. 3.2 Construction markers Besides word occurrence based feature classes we introduce a further feature class intended to capture aspects of the constructions in employ in the sentence. Some of these constructional features (K) concern clause semantics and sentence or clause structure — such as the transitivity of the clauses in the sentence, the occurrence of objective that-clauses or relative clauses, the occurrence of predicate constructions, the occurrence of manner, spatial, and temporal adverbials, etc. Other construction markers concern morphological features such as tense forms of verbs present in the sentence or the degree of comparison of occurring adjectives. As in the case of the word-based features, these features are extracted from the linguistic analysis. Most of them are based directly from the available information about the morphological or dependency status of a certain word in the sentence, while some other features need the aggregation of information from several words or different analysis levels. The palette of K features studied is chosen manually. In this experiment all constructional K features are treated as sentence features, exactly as the lexical I and F features are treated, i.e., no coupling between the features and the words carrying them is performed. The Connexor Functional Dependency (FDG) parser for English [15] In this experiment nouns, adjectives, verbs (including verbal uses of participles), adverbs, abbreviations, numerals, interjections, and negation are considered content words. prepositions, determiners, conjunctions, pronouns, ... 4. TEST DATA We base our experiment of data used in the NTCIR information retrieval evaluation challenge organised by NII, Tokyo, in its English section of the opinion analysis task. The data have been used by several research groups in a shared task for the last two workshops (NTCIR 6 and NTCIR 7) and we make use of the assessments for this experiment. In comparison, our classifier appears to yield a tie with the reported best result from the shared opinion identification task. (The NTCIR task also involved opinon classification, identifying polarity of the expressed opinion. We have not attempted that task here: it arguably has a stronger lexical base than that of identifying whether any attitude is expressed or not.) For generalisation we added to the NTCIR test sentence set the multi-perspective question answering (MPQA) test sentence set with assessed attitudinal sentences [13] and the 2007 Semantic Evaluation Affective Task (SEMEVAL) test set of news headlines [14], both of which have assessments by human judges. We use a lenient scoring scheme, scoring a sentence as attitudinal if two out of three NTCIR judges have marked it attitudinal; for the SEMEVAL data if the intensity score is over 50 or under -50. All attitudinal sentences or headlines, irrespective of source, are assigned the class att and all other sentences assigned the class noatt. Statistics for the collection are given in Table 1. Some sentences from the MPQA and NTCIR test sets, about ten in total, yielded no analyses and were removed from the test set. 5. BACKGROUND WORD SPACE MODEL Our experiment is based on a background language representation built by analysis of a reasonable-sized general text collection. We then use that model to establish similarities and differences between the sentences under analysis. Our aim is to investigate how the utterance or sentence under consideration is related to language usage in the norm, either by deviation from the norm in some salient way, or by conforming with an identified model of usage. In this experiment we use one year of newsprint from two Asian English-language news sources, the Korean Times and the Mainichi Daily with collection sizes as shown in Table 2. The collections are distributed as part of the NTCIR information retrieval evaluation challenge and have been used by several participants for training language models for NTCIR tasks, among them an opinion and attitude analysis task.[6, 11] As a control collection, we use one year of the Glasgow Herald, distributed as part of the CLEF information retrieval evaluation challenge.[2] For the background text material, we segment the text into sentences and process each sentence to extract the features given above — I, F , K. This gives us a high-dimensional feature space. We use this to build a cooccurrence-based first-order word space[10, 9], with all three types of features treated alike, using random indexing [7] for dimension reduction. In this word space, or feature space, each feature is accorded a position in a vector space based on which other features it cooccurs with in the training sentences. Initially, each sentence is given a thousand-dimensional representation vector with two randomly chosen non-null elements {1,−1}. Each feature is also given an initially empty context vector of the same dimensionality. This context vector is trained by scanning through each sentence in turn: “It is this, I think, that commentators mean when they say glibly that the ‘world changed’ after Sept 11.” I be think commentator mean when say glibly world change sept 11 F it this i that they that the after K Adverbial of time, Adverbial of manner, That subclause, Predicative, Intransitive clause, Transitive clause, Transitive mix, Present tense, Past tense, Tense shift “President Hafez Al-Assad has said that peace was a pressing need for the region and the world at large and Syria, considering peace a strategic option would take steps towards peace.” I president hafez al-assad have say peace be pressing need region world at large syria consider peace strategic option would take step peace F that a for the and the and a towards K Adverbial of manner, That subclause, Predicative, Intransitive, Transitive clause, Transitive mix, Present tense, Past tense, Tense shift, Verb chain “Mr Cohen, beginning an eight-day European tour including a Nato defence ministers’ meeting in Brussels today and tomorrow, said he expected further international action soon, though not necessarily military intervention.” I mr cohen begin eight-day european tour include nato defence minister meeting brussels today tomorrow say expect international action soon though not necessarily military intervention F an a in and he further K Adverbial of time, Adverbial of place, That subclause, Transitive clause, Past tense Figure 1: Example attitude analyses of sentences. These sentence are taken from the NTCIR opinion analysis task data set. The first two sentences are assessed by task judges to be opinion carriers, the last non-opinion. The content word feature “say” is a strong marker for opinion but would yield the wrong categorisation in this case; our linear classifier correctly identified the first two sentences as attitudinal and the last as non-attitudinal. NTCIR 6 NTCIR 7 SEMEVAL MPQA Attitudinal 1 392 1 075 76 6 021 Non-attitudinal 4 416 3 201 174 4 982 Total 5 808 4 276 250 11 003 Table 1: Test sentence statistics Korean Times Mainichi Daily Glasgow Herald Characters 326 486 123 744 2 158 196 Sentences 61M 25M 452M Table 2: Background text materials 0 50

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

SICS at NTCIR-7 MOAT: Constructions Represented in Parallel with Lexical Items

This paper describes experiments to find attitudinal expressions in written English text. The experiments are based on an analysis of text with respect to not only the vocabulary of content terms present in it (which most other approaches use as a basis for analysis) but also on structural features of the text as represented by presence of function words (in other approaches often removed by st...

متن کامل

Published vs. Postgraduate Writing in Applied Linguistics: The Case of Lexical Bundles

Abstract: Lexical bundles, as building blocks of coherent discourse, have been the subject of much research in the last two decades. While many of such studies have been mainly concerned with  exploring  variations  in  the  use  of  these  word  sequences  across  different  registers  and disciplines, very few have addressed the use of some particular groups of lexical bundles within some gen...

متن کامل

Between Bags and Trees - Constructional Patterns in Text Used for Attitude Identification

This paper describes experiments to use non-terminological information to find attitudinal expressions in written English text. The experiments are based on an analysis of text with respect to not only the vocabulary of content terms present in it (which most other approaches use as a basis for analysis) but also with respect to presence of structural features of the text represented by constru...

متن کامل

Equivalency and Non-equivalency of Lexical Items in English Translations of Nahj al-balagha

Lexical items play a key role in both language in general and translation in particular. Likewise, equivalence is a controversial concept discussed so widely in translation studies. Some theorists deem it to be fundamental in translation theory and define translation in terms of equivalence. The aim of this study is to identify the problems of lexical gaps in two translations of Nahj al-ba...

متن کامل

دایره واژگانی کلمه، کلام و کتاب در مکتب عرفانی ابن‌عربی

The domains of mystic topics in related text greatly expanded in the late 6th and 7th centuries (A.H). This expansion led to the use of new lexical items for the expression of new concepts. One domain is the lexical domain used for the description of different levels of existence. Letter (harf), word (kalameh), discourse (kalām), book (ketāb), pen (qalam), etc. are parts of this domain. This wo...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2009